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ABSTRACT 



Missing data is a common problem in virtually all surveys. 
This study focuses on variance estimation and its consequences for analysis 
of survey data from the National Center for Education Statistics (NCES) . 
Methods suggested by C. Sarndal (1992), S. Kaufman (1996), and S. Shao and R. 
Sitter (1996) are reviewed in detail. In section 3, the bootstrap method of 
Shao and Sitter is applied to the Schools and Staffing Survey (SASS) 1993-94 
Public School Teacher Survey component to assess the magnitude of imputation 
variance. This method is appealing, but requires repeated imputations, so for 
large scale surveys, the data files become too large. The empirical study 
shows, however, that using the hot deck imputation method in the 1993-94 SASS 
can affect the standard error seriously. However, the majority of items have 
very low stage 2 (hot deck) imputation rates. When the imputation rate is 
low, the inflation in variance is not severe. It appears feasible for NCES to 
compute imputation rates and document the problem with the next user's 
manual. (Contains 8 tables and 11 references.) (SLD) 
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1. Introduction 

^ Missing data is a common problem in virtually all surveys. In cross-sectional surveys, 

missing data may mean no responses are obtained for a whole unit being surveyed (unit 
nonresponse), or that responses are obtained for some of the items for a unit but not for 
other items (item nonresponse). In panel or longitudinal surveys, the data may be missing 
in more complex patterns. For example, a unit may respond to one wave of a survey but 
not respond to other waves (wave nonresponse). 

Unit and item nonresponse cause a variety of problems for survey analysts. Missing data 
can contribute to bias in the estimates and make the analyses harder to conduct and results 
harder to present. The most commonly used method for compensating for unit 
nonresponse in National Center for Education Statistics surveys is to adjust the weights of 

* the respondents so that survey analysts can use the observed data to make inferences for 
the entire target population. The most frequently used method to compensate for item 
nonresponse in NCES surveys is imputation. Imputation consists of replacing the missing 
data item with a value that is either taken directly from a value reported by another 
respondent in the same survey or derived indirectly using a model that relates 

* nonrespondents to respondents. 

In practice, imputed values are often treated as if they were true values. This procedure is 
appropriate for developing estimates of totals, means, proportions, and most other 
estimates of first-order population quantities like quantiles, if the imputation does not 
ft cause serious systematic bias. However, to estimate the variance of these estimators when 

there is imputed data, it is no longer appropriate to use the standard formulae. As early as 
the 1950s, Hansen, Hurwitz, and Madow (1953) recognized that treating imputed values 
as observed values can lead to underestimating variances of these estimators if standard 
formulae are used. This underestimation may become more appreciable as the proportion 
ft of imputed items increases. 

Analysts have developed a number of procedures to handle variance estimation of 
imputed survey data. In particular, Rubin (1987) proposed a multiple imputation 
procedure to estimate the variance due to imputation by replicating the process a number 
ft of times and estimating the between replicate variation. This multiple imputation 

procedure, however, may not lead to consistent variance estimators for stratified 
multistage surveys in the common situation of imputation cutting across sample clusters 
(Fay, 1991). Moreover, multiple imputation requires maintaining multiple complete data 
sets, which is operationally difficult, especially in large-scale surveys. More recently, 
ft Samdal (1992) outlined a number of model-assisted estimators of variance, while Rao 

and Shao (1992) proposed a technique that adjusts the imputed values to correct the usual 
or naive jackknife variance estimator for hot deck imputation. The Samdal and the Rao 
and Shao methods are appealing in that they yield unbiased variance estimators and only 
the imputed file (with the imputed fields flagged) is required for variance estimation, 
ft Kaufman (1996) proposed a variance estimation method similar to Samdafs method that 

can be used with a nearest neighbor imputation approach. Shao and Sitter (1996) 
proposed to perform an imputation procedure on each bootstrap sub-sample to 
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incorporate the imputation variability. This proposed bootstrap procedure is consistent 
irrespective of the sampling design, the imputation method, or the type of statistic used in 
inference. In fact, it is the only method that works without any restriction on the sampling 
design, the imputation method, or the type of statistic. 

This research focuses on variance estimation and its consequences for analysts of NCES 
survey data. In section 2, Sarndal’s method, Kaufman’s method, and Shao and Sitter’s 
method are reviewed in more detail. In section 3, Shao and Sitter’s bootstrap method is 
applied to the SASS 1993-94 Public School Teacher Survey component to assess the 
magnitude of imputation variance. 



2. Literature Review 

Three types of approaches to variance estimation in the presence of imputation are 
reviewed in this section: Samdal’ s model-assisted approach (Samdal, 1992). Kaufman’s 
method that can be used when a nearest neighbor imputation approach is taken (Kaufman, 
1996), and Shao and Sitter’s bootstrap variance estimation method (Shao and Sitter, 

1996). Shao and Sitter’s method will also be applied to the SASS Public School Teacher 
component in the next section. 

2.1 Samdal’ s Model- Assisted Method 

Samdal (1992) proposed a model-assisted method. The model-assisted approach uses the 
fact that most imputation methods have an underlying model which justifies the imputed 

values. Let U = [1 k, .... N) be the index set of a finite population. A probability 

sample s is selected from U with a given sampling design. Let r be the set of respondents, 
and nr the set of nonrespondents. The variable of interest is denoted by y. We are 
interested in the estimation of the population total Y v = ]T y y k . The data after imputation 

consist of the values denoted y,k, kes, such that 

_J y k > tfker 

y * k ~\y imp .k> ifkenr 

where y k is an actually observed value, and y ik denotes the imputed value for the unit 
k. In the case of 100 percent response, t = ^ k5s w k y k , where w k is the weight given to the 
observation y k . When the data contain imputations, the estimator of t is f, = '£ tkt . s w k y tk . 
The total error of f, is decomposed as 

= + f). 

The mean squared error (MSE) of /„ , denoted by V, is 

V = E p E v (i.-tf= V SAM + V IMP + 2V Mlx . 

Here p( ■) denotes the sampling design, that is, p(s) is the known probability of realizing 
the sample s, q( -Is) the response mechanism, that is, q(r\s) is the (unknown) conditional 
probability that the response set r is realized. It is assumed that q(\s) is an unconfounded 
mechanism. That is, it may depend on the covariate values e 5 }, but not on the 



values [y k :k e 5 } of the variable of interest. Vsam is the sampling variance, VW is the 

imputation variance, and V M /x is the mixed term which measures the covariance between 
the sampling error and the imputation error. The components in the mean squared error 
(MSE) of f 4 are hard to estimate unless a model for the relationship between x and y is 
brought in to assist the procedure. An example of such a model is: 

y k =/jx k +£ k for keU. 

The e k are random variables with E^(e k ) = 0, E^(e k ) = c r 2 x k and E^{£ k £ l ) = 0 for all k 

* 1 where denotes expectation with respect to the model. The anticipated MSE (that is 
the model expectation of the MSE) can be written as 

E 4 (V) = E i (V SAM )+E p E q [E'{0.-O 2 \s,r}} + 2E p E q [E<{(i-t)0,-O\s,r}]. 

The ^-expectations appearing in the true variance components can be evaluated without 
difficulty, leading to expressions which depend on known x k - values and on the 
unknown model parameters ft and c . The unconfoundedness of the nonresponse 
mechanism ensures that the order of expectation operators E = and E p E q can be reversed 



Construct V, AM , V , MP , V M , x such that 



then 



£ f {£,£,(V WM )-V WM }=0, 

E 4 {E p E q {V IMP )-V IMf> )=0, 

£*{£,£,0V*)-V M/x }=0, 



V -V +V +2V 

y v SAM ^ v IMP ^ AV MIX 



is anticipated to be unbiased for V. That is, 

E 4 {E p E q (V)~ v } = 0. 

Samdal (1992), Lee, Rancourt, and Samdal (1995), and Rancourt, Sarndal, and Lee 
(1994) applied this approach to four different imputation methods: respondent mean 
imputation, hot-deck imputation, ratio imputation, and nearest neighbor imputation. 



With Samdal’ s method, the total variance can be estimated without multiple imputation 
but an explicit model for the relationship between auxiliary variable x and y is needed to 
assist the procedure. Therefore, if the imputation method is hard to model or if there is 
not enough evidence to assume a model for the relationship between y and x, this 
procedure is hard to implement. Also the unconfoundedness is satisfied often by 
assuming the response mechanism does not depend on the y-values, which is not always 
true. 



2.2 Kaufman’s Method 



In practice, nearest neighbor imputation is often conducted in such a way that, within 
each imputation cell, sampling units are sorted so that two nearest neighbors can be 
identified for each missing case: one in ascending order and another in descending order, 
for example. Let r be the set of responding units and nr be the set of nonresponding units. 
Kaufman (1996) considered the following imputation set-up: for each k enr , one of the 
two nearest neighbors (donors) is randomly selected and assigned to the missing item. 
That is, 

- f if * Gr 

if kenr ' 

Here l k is a random variable with P(l k = l)= P(l k =0) = 0.5, y ki is the value of the first 
donor, y k2 is the value of the second donor, and y k is the observed value for ker . Let 
t = y k be the population total of variable y. If all sampled units are observed, an 
unbiased estimator of t y is 

Here w k is the design weight (inverse of inclusion probability). If the data are imputed 
and if we can assume the imputation does not cause very much systematic bias, then it is 
appropriate to use the customary estimator of t y 

y- = l kes w ky k • 

In this section, we first derive the mean squared error of the underlying estimator y. , 
denoted by AfS£(y.), then we derive the variance of y. , denoted by V (y. ) , both under 
Kaufman’s imputation set-up. We also discuss Kaufman’s approach of deriving V(y.). 
For the mean square error of y. , notice 

MSE(y.)=E(y.-t y f 

= E(y.-y + y-t y f 

= E (y. - y) 2 + E (y - t y ) 2 + 2f[(y. - yXy - t y .)], 
and E(y-t y f =V(y), COV[(y. - y)>y]= COv[(y. - y),(y -f y )]= £[(y. 
hence 

MSE(y.)= E(y . - y) 2 + V(y) + 2COv[(y. - y),y]. (1) 

Here £(y. - y) 2 is the imputation variance, V(y) is the sampling variance, and 
C0v[(y. - y),y] is the covariance between the sampling error and the imputation error. 
However, to estimate the components on the right-hand side of MSE(y.), we need an 
explicit model for the relationship between y and some auxiliary variables. Under 
Kaufman’s imputation set-up, C0v[(y. - y),y] can be written in a slightly different form. 

Notice 

COV(y. -y,y)=COV,[£ / (y. - y),y]+ ^[COV^y. -y,y)]- 



Here subscript 1 is with respect to sampling design and nonresponse mechanism, 
subscript / is with respect to donor selection. Also notice 
E,{y. ~y)=E,(y.)-y 

= I ker w k y k + 1/2 X* 6 „ r W k y kl + 1/2 X, enr w, y k2 - y 
= 1/2 (y, +i ; 2 )-i' 

= y.-y> 

here y, = X* er w k y k + I, enr w,y,, , y 2 ^ kmr w k y k + '£ kenr w k y k2 , and 
COV,(y. -y,y) =£,[(;?. -y)y]-£,(y. -y)E,(y) 

= [£,(?.)- y]y-[£i(y-)-y]y 

= 0 , 

so COV (y. - y, y) = COV, [y. - y, y] . Therefore 

MSE(y.)=E(y. - yf + V(y)+2C0V,[(y. - y)y]- (2) 

MS£(y.) can also be decomposed in the following way: 

MSE(y.)=E(y.-t y ) 2 

= E(y.-E(y.)+E(y.)-t y f 
= V(y.) + [£(y.)-t,f. 

Here V(y.) is the variance of the underlying estimator y. , and £(y.)- t y is the bias of 
y.. And £(y.)-f v = £(y. - y) can be estimated by an assisting model t, in the following 

way. First evaluate the conditional expectation for given sample 5 and respondents r. 

E^y.-y \s,r)=d ? . 

Then for the given 5 and r, find a model unbiased estimator d « for . Here again we 
need a model and the assumption of unconfoundedness. The other component, V(y.), the 
variance of y. , is obtained as following 

V / (5'.) = V / (y. -y + y) = V(y. -y) + V(y) + 2COV[(y. -y),y]. (3) 

Since COV (y. - y , y) = COV, [y. - y, yl , V (y. ) can be written in a slightly different form 
V(y.) = V / (y.-y) + V(y)+2COV 1 [(y.-y),y], (4) 

The right-hand side can be written in a form only with respect to the sampling design and 
response mechanism. Notice 

v (y)= v i(y)’ 

V(y. - y) = v t (y. - y ) + £, [l/4 £, enr w] (y 2 , + y k2 )] . 

Therefore, 

V (>'* ) = v , O') + v i(y*-y) +E i [l/ 4 X* e „r w k (y 2 k\ +>'; 2 2)] + 2COV l (y. -y,y). (5) 

To estimate the variance components on the right side, however, we have to resort to an 
explicit model for the relationship between y and some auxiliary variables like Samdal’s 
approach. 



Another decomposition of V (y. ) is simpler and probably more interesting. We 
decompose V(y.) into two parts: the sampling variance of the expected imputation value 
and the sampling expectation of the donor selection variance: 

V(y.) = V,£ / (y.)+£,V / (y.) 

= V,[l/2(y, +y 2 )]+£ 1 V,[X t6nr vv4y t . / *, +y t2 0- 7 Oy 

= V, (y. ) + £, [i/42;.„, + r; 2 )] . (6) 

Again, however, we need a model to estimate V, (y. ) . 

Kaufman (1996) took another approach to estimate V(y.). In Kaufman’s method, a 

residual is defined for each kes: 

( 0 if iter 

3k = {( 2/ * -Otaa-^A.) ' fkenr 

where j k is a missing item within missing item k’s imputation cell and (y ikl ~ Yj k \) is the 

difference between the two nearest neighbors (donors) of j k . Missing item j k is selected 
independently from k’s imputation cell with known selection probability, for example, 
with selection probability proportional to design weights w k . Then the residual is 

attached to y k to form another quantity Y , which is used for the purpose of variance 
estimation: 

= >.+<?'= I, +3[). 

The variance of Y contains variability from both y. and d R . It can be shown that 
(Theorem 1 of the appendix) 

V(y) = V(y) + V(y - y . ) + 2 COV t (y. - y, y) + E x V 2 (d R ) . 

Here V 2 [d R j = E, V R {d R j + V, E R {d R j . According to formula (4) above or theorem 3 of the 
appendix, we have 

v(y.)=v(y)-£,v 2 ^). ' (7) 

Therefore, the variance of y. is the difference between v(y^ and E^V 2 {d R ^. Although the 

estimator of E^V 2 {d R ) is often easy to find, the variance of Y is often hard to estimate, 

unless it can be shown that the same variance estimator for y can be used or an explicit 
model can help. Like Sarndal’s method, Kaufman’s method does not require multiple 
imputation but the estimator for may be hard to find and may need a model to assist 

the variance estimation. In addition, Kaufman’s imputation method introduces a donor 
selection variance component into the total variance, which in turn inflates the total 
variance. Therefore, it is less efficient than Sarndal’s method. Nevertheless, this method 
leads to the same decomposition as formulae (4) and (3) (theorem 3 of the appendix). 



2.3 Shao and Sitter’s Method 



Shao and Sitter (1996) proposed a bootstrap method for variance estimation of imputed 
data. Although they only proved that this method produces consistent bootstrap 
estimators for mean, ratio, or regression (deterministic or random) imputations under 
stratified multistage sampling, they believe that in fact the proposed bootstrap is the only 
method proposed thus far that works irrespective of the sampling design (single stage or 
multistage, simple random sampling or stratified sampling), the imputation method 
(random or nonrandom, proper or improper), or the type of estimator (smooth or 
nonsmooth). The method is paraphrased as following: 

1) Draw a simple random sample jy,*:/ = l,...,n} with replacement from the original 
imputed data set Y, = {y t :k e A R ( respondents ), Tj k :k eA M ( nonrespondents )} . 

2) Let Y; = jy*:i e A * R } and Y^ = {y,*:i e A’ M } , where A' R and A' M denote the set of 

respondents and nonrespondents in the bootstrap sample. Apply the same imputation 
procedure used in constructing Y, (using Y* to impute Y^ ), and denote the 
bootstrap analog of Y, by Y* . 

3) Obtain the bootstrap analog 6 ] = <?(y,* ) of 0, = 0(Y, ) , based on the imputed 
bootstrap data set Y ; * . 

4) Repeat steps 1) - 3) B times. Apply Monte Carlo approximations to obtain bootstrap 
variance estimators for 6 , : 



here 0* = B~ x jf b=x d* b . 

Shao and Sitter’s method does not require any model or explicit variance formulae. Once 
the imputation procedure is programmed appropriately, Shao and Sitter’s method is easy 
to implement. However, since B imputations should be performed for each item, 
extensive computation is required for large scale surveys. Maintaining the large amount 
of imputed data can be operationally difficult. 

3. An Empirical Study 

We chose Shao and Sitter’s method to assess the magnitude of imputation variance in the 
SASS 1993-94 Public School Teacher Survey component based on the following 
considerations: 1) bootstrap method is used in SASS 1993-94 for variance estimation; 2) 
we do not have any reliable model on hand to allow us to perform Samdal’s or 
Kaufman’s method; 3) Kaufman’s method nearest neighbor imputation has donor 
selection while the SASS 1993-94 imputation does not. 




SASS 1993-94 Public School Teacher Survey data contains information on the 47,105 
public school teachers who responded to the survey. 

Four types of imputation methods are used in SASS 1993-94. They are (paraphrasing 
from Abramson et al., 1996, page 80): 

(1) using data from other items of the same unit on the questionnaire; 

(2) extracting data from a related component of SASS (for example, using data from a 
school record to impute missing values on the questionnaire for the LEA that operates 
the school); 

(3) extracting data from the frame file (the information about the sample case from the 
sampling frame: the Private School Survey or the Common Core of Data); 

(4) extracting data from the record for a sample case with similar characteristics (“hot 
deck ”). 

In this study, we investigated imputation method (4) — also called “stage 2 imputation.” 
Methods (1), (2), and (3) are deductive imputation methods. In method (1), the imputed 
values are from other observed items of the same unit and in method (3) the imputed 
values are from the sampling frame file (PSS or CCD). For imputation method (2), the 
LEA’s missing item is imputed through information from the sampled school which 
belongs to that LEA. According to Abramson et al. (1996), this type of imputation was 
performed only to the one-school LEAs. Therefore, the imputed values by methods (1), 
(2), or (3) are independent of the sample and the sample design. Assume the simplest 
response mechanism: respondents always respond and nonrespondents never respond. 
Then if the population is {y, ,y 2 , .... y N }, the imputed values can be assumed to be 

{z,,z 2 , •••> Z/v}- Here if y k is actually observed, then z k = y k , otherwise z* equals the 

value imputed by any method of (1), (2), or (3). Let t y = y k be the population total 

of y, t, = X^ =l z* be the population total of z, and t z = Y^ s z kl n k be the Horvitz-Thompson 
estimator of t. (here n k is the inclusion probability of unit k). We have the following 
decomposition 

MSE(t z ) = V(i) + (t : -t y f . 

The first part, v(t z ), can be estimated by treating the imputed values as observed values 

while the second part is the bias of the imputation and assessing this bias is out of the 
scope of this study. If there is reason to believe the imputation bias is small, then treating 
the values imputed by any method of (1), (2), or (3) as observed values and using a 
standard variance estimation formula will not underestimate the variance. Or, if we can 
estimate the systematic bias caused by the imputation, the mean square error of the 
underlying estimator can then be estimated. 

For method (4), however, the imputed data can not be treated as observed values. 

Actually every imputed value is a function of the sample, therefore the imputed values 
cannot be represented as a set of fixed values as {z, ,z 2 , ..., z N } • 



SASS surveys are designed to produce reliable state estimates, and samples are selected 
systematically without replacement with large sampling rates within strata. To reflect the 
increase in precision due to large sampling rates, a without replacement bootstrap 
variance estimator procedure has been implemented for the 1993-94 SASS. Instead of 
drawing a simple random sample with replacement from the original sample, the 
bootstrap is done systematically without replacement with probability proportional to size 
as the original sampling was performed (Abramson et al., 1996). 



In SASS 1993-94 components, 48 replicate weights were created to estimate variance 
using the bootstrap method. These replicate weights were subjected to various 
adjustments, including a sampling adjustment, a noninterview adjustment, and a ratio 
adjustment. In order to reflect these adjustments, these replicate weights should be used in 
the variance estimation. To this end, we used the Shao and Sitter’s method in the 
following manner: 



1) For each set of replicate weights } 0=1,2,..., 48), cases with w ik = 0 are 



dropped. Denote the remaining cases, which make up a bootstrap sub-sample, as 
Y n = {y k :k e A Ri ,rj k :k e A Mi )} (i = 1,2, ..., 48). This corresponds to Shao and Sitter’s 

step 1. 

2) Apply the same imputation method as was used to create the full sample imputation 
values and use {y* : k e A Ri } to impute : k e A Mi J (i = 1,2, . . ., 48). This 



corresponds to Shao and Sitter’s step 2. This re-imputed bootstrap sub-sample is 
denoted as s, . That is 

5,. = {y, :te\}u {rj ik : k e A Mi }. 

The missing values in the full sample are also imputed by using the nonmissing 
values in the full sample. This set of imputed values is denoted as 

5 0 = 

Thus, 48 sets of imputed bootstrap sub-samples and 1 set of imputed full sample are 
obtained. 

3) Calculate the Q, of interest from s , , weighted by replicate weights {w (i . } (/ = 1,...48), 
and the 6 from full sample s 0 , weighted by the full sample weight {w* }. The 



variance of 6 is estimated by 

•0 




Another difference between the variance estimator we used above and Shao-Sitter’s 
estimator is that in our formula the deviation is around the full sample estimate 6 
whereas in Shao-Sitter’s formula the deviation is around the average of the bootstrap 
estimates 6 * . The balanced repeated replication method (BRR) is implemented in 
WesVar PC, but the bootstrap method is not. Abramson et al. (1996) suggests that with 
any BRR software package, the BRR option should be specified for 1993-94 SASS data 
analysis. The formulae used in WesVar PC for the BRR option is the formula we used 
above. In general, 
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here 5* = B~'^ =t 0 mb . Notice -#)" = £,,£ B (5* -&) ■ Here E P is with respect to 
sample design, E B is with respect to bootstrap subsampling, and typically E B {d*^ = 6 . 



Therefore E B (& * - 6^j = Var B {d * j . An unbiased estimator of Var B {d * ) is 

- «')’ • Therefore 
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When 5 is large the bias in variance estimation is small and can be easily corrected by 
factor (5 - 1 )/B . In our study, we compare standard error estimates instead of variance 

estimates and B = 48 , so the adjustment factor is ^47/48 « 0.99 . We do not apply this 



adjustment because it is close to 1. In addition, we use the same formula to calculate both 
the standard error estimates cooperating imputation variance and the standard error 
estimates without cooperating imputation variance. And the ratio of these two types of 
standard error estimates is used as the measurement of the difference. Therefore, the 
adjustment factor has no effect on this ratio. 



The variables used for this study include 6 categorical variables and 7 continuous 
variables. Their stage 2 imputation — method (4), rates range from 2 percent to 25 percent 
(see table 1). 



During stage 2 imputation, method (4), a hot deck method, was used to fill items that had 
missing values. The procedure started with the specification of imputation classes defined 
by certain relevant variables (matching variables). Then the records were sorted by 
STGROUP (Groups of states with similar schools) / STATE / TEALEVEL (Instructional 
level for teacher) / GRADELEV (Grade levels taught this year) / URB (Type of 
community where school located) / TEAFEELD (Teaching assignment field) / 
ENROLMNT (Number of students enrolled in the school). The records were then treated 
sequentially. A nonmissing y-variable was used as a starting point for the process. If a 
record had a response for the y-variable, that value replaced the value previously stored 
for its imputation class. If the record had a missing response, it was assigned the value 
currently stored for its imputation class. If there was no donor in the class, the class was 
collapsed with another class. The matching variables and collapse order are listed in table 
7 and table 8. 



Most of the variables used for sorting or matching the records are not included in the data 
file; they had to be reconstructed by using other variables in the data file. This caused a 
discrepancy between the data imputed for this study and the original imputed data in the 
file. To prevent confounding the imputation difference with imputation variance, we 
imputed the full sample with our sorting and matching variables and denote this imputed 
full sample as s 0 . This is the sample used in the variance estimation (see imputation 
procedure step 3 above). 
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From Table 2 to Table 6, we compare standard errors which do not take the imputation 
variance into account ( ste^j) with the standard errors incorporated with imputation 

variance ( ste, ^?j). It is important to emphasize that both ste, and ste are estimates 

of standard errors instead of true standard errors and therefore both of them are also 
subjected to sampling errors. 

Table 2 compares standard errors for the total estimator of continuous variables. The 
output shows the imputation does not inflate the variance for the total very much. For 
variable T0985, the standard error increases only 7 percent even though the imputation 
rate is as high as 27 percent. 

Table 3 compares standard errors for the average per person estimators of continuous 
variables. The underlying estimator is actually a nonlinear estimator. When the 
imputation rate is high, inflation to the variance can be very high, too. For example, 
variable T0985 now shows ste,(&J is 41 percent higher than ste(&y So if the imputed 

data are treated as true values, the underestimation can be severe. 

Table 4 compares standard errors for the total estimator of categorical variables. Here the 
total estimates are estimated total counts in each category. Notice the inflation in variance 
is larger than the total estimators of continuous variables. This might be due to the fact 
that the sample sizes of the categorical variables are smaller (there is more legitimate 
skipping for these items). It also shows that when imputation rates get higher, the increase 
in standard errors also gets larger although the increase is not exactly linear. Now variable 
T0040 shows the biggest difference: 2.04. 

Table 5 compares standard errors for the percentage estimators of discrete variables. Here 
the percentage is the estimated percent of units in each category. The underlying 
estimators are nonlinear estimators. The results are quite similar to those in table 4. 

Table 6 compares standard errors for the ratio estimators of continuous, variables. 

Variable BASIC is the ratio of teacher’s basic salary to teacher’s total income. Variable 
INSCH is the ratio of teacher’s total income at school to teacher’s total income. 

OUTSCH is the ratio of teacher’s total income from outside of school to teacher’s total 
income. ADITION is teacher’s other income from school (total income inside school 
minus base salary) to teacher’s total income. IN_OUT is teacher’s total income inside 
school to teacher’s total income outside school. Although some variables used for the 
ratios have high imputation rates (T1440, for example, has a 21.3% imputation rate) the 
increase in standard errors are very small. Again, for continuous variables, we observed 
smaller inflation in standard error. 



4. Summary and Suggestion 

The techniques developed so far for the variance estimation of imputed data are not yet 
easy to implement or operationally convenient. Shao and Sitter’s method is appealing but 
requires repeated imputations, so for large scale surveys the data files become too large. 

However, our empirical study shows that using the hot deck imputation method in the 
1993-94 SASS can seriously affect the standard error. 

But notice that the majority of items have very low stage 2 (hot deck) imputation rates. 
For the SASS 1993-94 Public School Teacher component, only 1 1 out of 249 items had 
stage 2 imputation rates above 10 percent (see Gruber, Rohr, and Fondelier, 1996, figure 
VHt-24, pp. 231-235). We used six of those items for this study. And, when the 
imputation rate is low, the inflation in variance is not severe, especially for continuous 
type variables. We believe it is feasible for NCES to compute the imputation inflation for 
the total and ratio estimators for the few items that have high imputation rates and 
document the problem with next user’s manual. This will alert secondary users to the 
possible magnitude of the imputation variance. 



Table 1: Variables used in this study 



Name 


Label 


Stage 2 imputation 
rate (%) 





T0030 


2 Full/Part-time teacher at this school 


11.8 


5 Categories 


T0035 


3 A Have other assignment at this sch 


9.8 


Dichotomous 


T0040 


3B What is other assignment at this sch 


24.0 


6 Categories 


TO 140 


1 ID Consecutive yrs teaching since break 


5.2 


Continuous 


T0435 


28A Any mathematics courses taken 


5.7 


Dichotomous 


T0645 


32B Programs changed views on teaching 


2.0 


5 Categories 


T0860 


40B(4) Number of students in the class 


13.6 


Continuous 


T0985 


41 C Number of separate classes taught 


27.0 


Continuous 


T1420 


53B(1 ) Academic yr base tchng salary 


8.3 


Continuous 


T1430 


53B(2) Additional compensation earned 


4.0 


Continuous 


T1440 


53B(3) Earning from job outside sch sys 


21.3 


Continuous 


T1455 


53B(5) Income earned from other source 


5.9 


Continuous 


T1520 


55 Total income of all HHD family member 


25.0 


12 Categories 



Source: Abramson et al. (1996). 



Table 2: Standard error comparison for total estimates of continuous variables 



Name 


Stage 2 
imputation 
rate (%) 


Estimate 


ste(d^ 


ste , 


ste f [d^l ste(d^ 


TO 140 


5.2 


8985367 


154697 


153875 


0.99 


T0860 


13.6 


24958128 


411554 


417361 


1.01 


T0985 


27.0 


2107888 


72049 


77165 


1.07 


T1420 


8.3 


86349560396 


805679800 


808307241 


1.00 


T1430 


4.0 


1865774738 


36016613 


37220591 


1.03 


T1440 


21.3 


2179435663 


87253029 


89579851 


1.03 


T1455 


5.9 


588847739 


20784683 


20928990 


1.01 



Table 3: Standard error comparison for average estimates of continuous variables 



Name 


Stage 2 
imputation 
rate (%) 


Estimate* 


ste(3 j 


ste, 


ste,{d^l ste{d} 


TO 140 


5.2 


11.01 


0.085 


0.082 


0.96 


T0860 


13.6 


22.79 


0.077 


0.085 


1.10 


T0985 


27.0 


12.79 


0.157 


0.222 


1.41 


T1420 


8.3 


33713.26 


88.146 


89.404 


1.01 


T1430 


4.0 


2093.88 


28.232 


29.667 


1.05 


T1440 


21.3 


4384.44 


161.861 


170.351 


1.05 


T1455 


5.9 


1676.05 


48.636 


50.182 


1.03 



* These estimates are average per teacher. 



BEST COPY AVMIABLE 



20 



13 



Table 4: Standard error comparison for total estimates of discrete variables 



Name 


Stage 2 
imputation 
rate (%) 


Categories 


Estimate 


ste^ffj 


ste, 


ste ,{d^l ste{d^ 


T0030 


11.8 


1 


12994 


1662 


1835 


1.10 






2 


31489 


2190 


2502 


1.14 






3 


97607 


3719 


4156 


1.12 






4 


52767 


2583 


2871 


1.11 






5 


36706 


1993 


2748 


1.38 


T0035 


9.8 


1 


54006 


1969 


2130 


1.08 






2 


166845 


4162 


4161 


1.00 


T0040 


24.0 


1 


9613 


1210 


1739 


1.44 






2 


11737 


864 


1760 


2.04 






3 


5093 


803 


1015 


1.26 






4 


12311 


849 


1465 


1.73 






5 


26962 


1844 


2335 


1.27 






6 


5543 


715 


1158 


1.62 


T0435 


5.7 


1 


2001004 


17316 


17157 


0.99 






2 


560289 


8838 


8807 


1.00 


T0645 


2.0 


1 


122310 


4354 


4298 


0.99 






2 


822249 


10566 


10638 


1.01 






3 


498908 


8204 


8187 


1.00 






4 


711355 


10300 


10452 


1.01 






5 


103472 


' 3174 


3105 


0.98 


T1520 


25.0 


1 


173 


57 


82 


1.45 






2 


863 


185 


301 


1.63 






3 


8850 


698 


723 


1.03 






4 


72952 


2592 


3045 


1.18 






5 


123771 


4036 


4804 


1.19 






6 


154036 


3771 


4152 


1.10 






7 


174850 


4497 


5301 


1.18 






8 


404821 


6425 


7594 


1.18 






9 


434259 


8408 


9091 


1.08 






10 


523142 


8156 


10362 


1.27 






11 


438739 


8604 


9664 


1.12 






12 


224836 


5327 


6480 


. 1.22 



O 
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Table 5: Standard error comparison for percentage estimates of discrete variables 



Name 


Stage 2 
imputation 
rate (%) 


Categories 


Estimate 

(%) 


ste^j 


ste f ^ 


ste l (0^ st 


T0030 


11.8 
















1 


5.61 


0.691 


0.763 


1.10 






2 


13.60 


0.838 


0.991 


1.18 






3 


42.15 


1.383 


1.645 


1.19 






4 


22.79 


1.019 


1.150 


1.13 






5 


15.85 


0.882 


1.195 


1.35 


T0035 


9.8 
















1 


24.45 


0.775 


0.842 


1.09 






2 


75.55 


0.775 


0.842 


1.09 


T0040 


24.0 
















1 


13.49 


1.549 


2.392 


1.54 






2 


16.47 


1.169 


2.443 


2.09 






3 


7.15 


1.098 


1.411 


1.29 






4 


17.28 


1.227 


2.038 


1.66 






5 


37.84 


1.861 


2.835 


1.52 






6 


7.78 


0.912 


1.562 


1.71 


T0435 


5.7 
















1 


78.12 


0.284 


0.279 


0.98 






2 


21.88 


0.284 


0.279 


0.98 


T0645 


2.0 
















1 


5.42 


0.191 


0.188 


0.98 






2 


36.41 


0.359 


0.364 


1.01 






3 


22.09 


0.283 


0.291 


1.03 






4 


31.50 


0.339 


0.341 


1.01 






5 


4.58 


'0.136 


0.132 


0.97 


T1520 


25.0 
















1 


0.01 


0.002 


0.003 


1.60 






2 


0.03 


0.007 


0.012 


1.68 






3 


0.35 


0.027 


0.028 


1.04 






4 


2.85 


0.099 


0.114 


1.15 






5 


4.83 


0.145 


0.176 


1.22 






6 


6.01 


0.133 


0.149 


1.12 






7 


6.83 


0.172 


0.199 


1.16 






8 


15.81 


0.215 


0.280 


1.30 






9 


16.95 


0.291 


0.332 


1.14 






10 


20.42 


0.292 


0.368 


1.26 






11 


17.13 


0.293 


0.349 


1.19 






12 


8.78 


0.204 


0.248 


. 1.21 



O 

ERIC 
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Table 6: Standard error comparison for ratio estimates of continuous variables 

Basic = T1420/(T1420 + T1430 + T1440 + T1455) 

Insch = (T1420 + T1430)/(T1420 + T1430 + T1440 + T1455) 

Outsch=T 1 440/(T 1 420 + T1430 + T 1440 + T1 455) 

Addition=T1430/(T1420 + T1430 + T1440 + T1455) 

In_out=(T1420 + T1430)/(T1440 + T1455) 



Name 


Stage 2 

Imputation rate (%) 


Estimate 


ste[d^ 


ste^&j ste 


6^1 ste{d^ 


Basic 


— 


0.94907 


0.000966 


0.000977 


1.01 


Insch 


- 


0.96957 


0.000909 


0.000938 


1.03 


Outsch 


~ 


0.02395 


0.0008819 


0.0009020 


1.02 


Addition 


- 


0.02051 


0.0003578 


0.0003757 


1.05 


In_out 


-- 


31.87 


0.9823 


1.010 


1.03 



Table 7: Public School Teacher (SASS-4A) matching variables 



Items 


Matching variables 


T0030, T0035, T0040 
T0140 

T0435, T0645 

T0860 

T0985 

T1420, T1430,T1440, T1455 
T1520 


STGROUP, STATE, TEALEVEL, URB, ENR 
STGROUP, STATE, TEALEVEL, AGE, HIGHDEG 
STGROUP, STATE, TEALEVEL, HIGHDEG, TEAEXPER 
STGROUP, TEALEVEL 

STGROUP, STATE, TEALEVEL, FULPTIME, TEAEXPER 
STGROUP, STATE, TEALEVEL, URB, HIGHDEG, TEAEXPER 
STGROUP, STATE, TEALEVEL, URB, HIGHDEG, TEAEXPER 


Source: Gruber, Rohr, and Fondelier (1996), figure VIII-28. 



Table 8: Public School Teacher (SASS-4A) order of collapse 


Items 


Order of collapse 


T0030, T0035, T0040 
TO 140 

T0435, T0645 

T0860 

T0985 

T1420, T1430,T1440, T1455 
T1520 


ENR, URB, STATE 
HIGHDEG, AGE, STATE 
TEAEXPER, HIGHDEG, STATE 
TEALEVEL 

TEAEXPER, FULPTIME, STATE 
TEAEXPER, HIGHDEG, STATE 
TEAEXPER, HIGHDEG, TEALEVEL 


Source: Gruber et al. (1996), figure VIII-28. 
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Appendix 

This appendix presents results we derived for Kaufman’s method. In Kaufman’s method, 
a residual is defined for each kes: 

~ s j0 if k er 

dk ~[ ( 2I k-^yj t 2-yj t i) 'Uznr 

where j k is a missing item within missing item k ' s imputation cell and ( y h2 ~y Jk i) is the 

difference between the two nearest neighbors (donors) of j k . Missing item j k is selected 
independently from k’s imputation cell with known selection probability; for example, 
with selection probability proportional to design weights w k . Then the residual is 

attached to y k to form another quantity Y used for the purpose of variance estimation: 

y = y. + rf* = ^ (y. 

The variance of Y contains variability from both y. and d R . 

Lemma 1. Let y. = J, kes w k y k and y. = 1/2 (y, + y 2 ). Here y, = J Jk€r w k y k + 'Z k6nr w k y k , , 
y 2 = Xter w kYk + Sienr w k Y ki > and £ 2 is with respect to the imputation selection and the 
residual selection. Then E 2 (y. )= y. . 

Proof: £ 2 (5 ; .) =£ 2(I, ej H',y,)=I i . er w,y, +'£ k€nr w k E 2 [y kt I k +y* 2 (l-/,)] 

= X* er y* + 2 k6nr w k (0-5 y,, + 0 5y k2 ) = 1/2 (y, + y 2 ) 

= y. ■ 

Lemma 2. Let Y = y.+d R . Here d R = ^ kes w k d k and 

J 0 if k e r 

dk ~j( 2/ * -1X^2 -?;,>) if * e ' ir - 

Then E 2 {y\= y. . 

Proof: Since £■,(/) = £ 2 (y.) + we only need to show E 2 {d R ^= 0. 

Actually 

E2(j R )=I u „».E 1 ph -OKi-?*, )]■=«. 

Combine Lemma 1 and Lemma 2: we can see £, = £ 2 (y. ) . 

Lemma 3. V, £ 2 (P) = V { (y. - y) + V, (y) + 2COV, (y. - y, y) . Here V, is with respect to the 
sample design and the nonresponse mechanism. 

Proof: V, £, (y) = V, £, (y. ) Lemma 2 

= V, (y. ) Lemma 1 

=v,(y. -9 + y) 
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= V, (y. - y) + V, (y) + 2C0 V, (y. - y , y) . 



Lemma 4. y. and d R are uncorrelated with respect to imputation selection and residual 
selection; that is, V 2 + d R | = V 2 (y. ) + V, {d R j . 

Proof: Notice that 

V 2 (y. + d R ) = E,\ V R (y. + d R )- + P, E r (y. + d R ): = E, V R (d R ) + P, [s* (y. ) + E r [d R )] 



= E,V R (d R ) + V,\y.+E R (d R )] 

= E,V R (d R ) + V,(y.) + V,E R (d R )+2COV l (y.,E R (d R )) 



=v / (y.)+£ / v,(^ /? )+v / £,(J' ? ). 

Therefore, we only need to show 2 COV, , E R (d R = 0 . Notice 

and 

£»(<*')=!, „».£.(#)- Z,„ -.(2'. • 

Here fi R = E R (y h2 - y h , ) . Also notice 

+I, enr w,(y,,/, +y, 2 (l -/,))] [I h 6 nr w h( 2I » - 'K ]} 

= E / [L k er IsHenr W k >>* W H ( 2/ h ~ 0 Mh + X, enr X* enr W k (>* 1 7 * + ?*2 0 " 7 * )) W h { 2I h ~ >K ] 

= 0 + I^I^ M '*^[y*,£ / (2/ t 7 A -/ t )+y i2 £ / (l-/ t X2/, -1)] 

= 0, 
and 

[2, „, n (2/, -1K]=°; 

therefore, 

cov / (y.,£ R (J R ))=o. 



Lemma 5. V'(y)+V'(y-.y.) = V' 1 (y)+V',(y-y.)+£,V' 2 (y.). Here subscript 1 is with 

respect to the sampling design and nonresponse mechanism, subscript 2 is with respect to 
the imputation selection and the residual selection. No subscript is with respect to all 
variance components. 

Proof: V(y) = V, E 2 (y) + £, v 2 (y ) = v, (y) , 

v'(y-> i .) =v 'i £ '2(> > -> ; .) + ^i v 2(i ? — y.) 

=v 1 (y-y.)+£,v 2 (y.) 



Lemma 1 



Theorem 1. v(r) = V(5J) + V(>; ->>.)+ 2C0V, (y. - y, >>)+£', 

Proof: v (y) = v, e 2 (y) +e { v 2 (y) 

= V, (y. - y) + v, (y) + 2 CO V, (y. - y, y)+ E,V 2 (y. + d R ) Lemma 3 

= V, (y. -y) + V l (y) + 2C0V, (y. - y, y) + E,V 2 (y. )+E,V 2 (d R ) Lemma 4 
= v(y.-y) + V(y) + ICO V, (y. - y, y) + E,V 2 (<?*). Lemma 5 

Theorem 2. v(y) = V(y.) + E { V 2 (d R }. 

Proof: v(y) = V,£ 2 (y)+ E,V. 2 (y) 

= V,£ 2 (y.)+£ 1 V 2 (y.)+£,V 2 (J' ? ) Lemma 1,2, and 4 

= V(y.)+E l V 2 (d R ). 

Theorem 3. V(y. ) = V(y) + V(y - y. ) + 2 COV, (y.-y, y) 

Proof follows directly from theorem 1 and theorem 2. 

The result in theorem 3 is actually the same as formula (4) of section 2.2, as it should be. 
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